15 research outputs found
Autoencoders for natural language semantics
Les auto-encodeurs sont des rĂ©seaux de neurones artificiels qui apprennent des reprĂ©sentations. Dans un auto-encodeur, lâencodeur transforme une entrĂ©e en une reprĂ©sentation, et le dĂ©codeur essaie de prĂ©dire lâentrĂ©e Ă partir de la reprĂ©sentation. Cette thĂšse compile trois applications de ces modĂšles au traitement automatique des langues : pour lâapprentissage de reprĂ©sentations de mots et de phrases, ainsi que pour mieux comprendre la compositionnalitĂ©.
Dans le premier article, nous montrons que nous pouvons auto-encoder des définitions
de dictionnaire et ainsi apprendre des vecteurs de définition. Nous proposons une nouvelle
pĂ©nalitĂ© qui nous permet dâutiliser ces vecteurs comme entrĂ©es Ă lâencodeur lui-mĂȘme, mais
aussi de les mélanger des vecteurs distributionnels pré-entraßnés. Ces vecteurs de définition
capturent mieux la similarité sémantique que les méthodes distributionnelles telles que
word2vec. De plus, lâencodeur gĂ©nĂ©ralise Ă un certain degrĂ© Ă des dĂ©finitions quâil nâa pas
vues pendant lâentraĂźnement.
Dans le deuxiÚme article, nous analysons les représentations apprises par les auto-encodeurs
variationnels séquence-à -séquence. Nous constatons que les encodeurs ont tendance à mémo-
riser les premiers mots et la longueur de la phrase dâentrĂ©e. Cela limite considĂ©rablement
leur utilité en tant que modÚles génératifs contrÎlables. Nous analysons aussi des variantes
architecturales plus simples qui ne tiennent pas compte de lâordre des mots, ainsi que des mĂ©-
thodes basĂ©es sur le prĂ©-entraĂźnement. Les reprĂ©sentations quâelles apprennent ont tendance
à encoder plus nettement des caractéristiques globales telles que le sujet et le sentiment, et
cela se voit dans les reconstructions quâils produisent.
Dans le troisiĂšme article, nous utilisons des simulations dâĂ©mergence du langage pour
Ă©tudier la compositionnalitĂ©. Un locuteur â lâencodeur â observe une entrĂ©e et produit un
message. Un auditeur â le dĂ©codeur â tente de reconstituer ce dont le locuteur a parlĂ© dans
son message. Nous Ă©mettons lâhypothĂšse que faire des phrases impliquant plusieurs entitĂ©s,
telles que « Jean aime Marie », nécessite fondamentalement de percevoir chaque entité comme
un tout. Nous dotons certains agents de cette capacitĂ© grĂące Ă un mechanisme dâattention,
alors que dâautres en sont privĂ©s. Nous proposons diffĂ©rentes mĂ©triques qui mesurent Ă quel
point les langues des agents sont naturelles en termes de structure dâargument, et si elles sont davantage analytiques ou synthĂ©tiques. Les agents percevant les entitĂ©s comme des touts
Ă©changent des messages plus naturels que les autres agents.Autoencoders are artificial neural networks that learn representations. In an autoencoder, the
encoder transforms an input into a representation, and the decoder tries to recover the input
from the representation. This thesis compiles three different applications of these models to
natural language processing: for learning word and sentence representations, as well as to
better understand compositionality.
In the first paper, we show that we can autoencode dictionary definitions to learn word
vectors, called definition embeddings. We propose a new penalty that allows us to use these
definition embeddings as inputs to the encoder itself, but also to blend them with pretrained
distributional vectors. The definition embeddings capture semantic similarity better than
distributional methods such as word2vec. Moreover, the encoder somewhat generalizes to
definitions unseen during training.
In the second paper, we analyze the representations learned by sequence-to-sequence
variational autoencoders. We find that the encoders tend to memorize the first few words
and the length of the input sentence. This limits drastically their usefulness as controllable
generative models. We also analyze simpler architectural variants that are agnostic to word
order, as well as pretraining-based methods. The representations that they learn tend to
encode global features such as topic and sentiment more markedly, and this shows in the
reconstructions they produce.
In the third paper, we use language emergence simulations to study compositionality. A
speaker â the encoder â observes an input and produces a message about it. A listener â the
decoder â tries to reconstruct what the speaker talked about in its message. We hypothesize
that producing sentences involving several entities, such as âJohn loves Maryâ, fundamentally
requires to perceive each entity, John and Mary, as distinct wholes. We endow some agents
with this ability via an attention mechanism, and deprive others of it. We propose various
metrics to measure whether the languages are natural in terms of their argument structure,
and whether the languages are more analytic or synthetic. Agents perceiving entities as
distinct wholes exchange more natural messages than other agents
DART: a Dataset of Arguments and their Relations on Twitter
International audienceThe problem of understanding the stream of messages exchanged on social media such as Facebook and Twitter is becoming a major challenge for automated systems. The tremendous amount of data exchanged on these platforms as well as the specific form of language adopted by social media users constitute a new challenging context for existing argument mining techniques. In this paper, we describe a resource of natural language arguments called DART (Dataset of Arguments and their Relations on Twitter) where the complete argument mining pipeline over Twitter messages is considered: (i) we identify which tweets can be considered as arguments and which cannot, and (ii) we identify what is the relation, i.e., support or attack, linking such tweets to each other
Learning GFlowNets from partial episodes for improved convergence and stability
Generative flow networks (GFlowNets) are a family of algorithms for training
a sequential sampler of discrete objects under an unnormalized target density
and have been successfully used for various probabilistic modeling tasks.
Existing training objectives for GFlowNets are either local to states or
transitions, or propagate a reward signal over an entire sampling trajectory.
We argue that these alternatives represent opposite ends of a gradient
bias-variance tradeoff and propose a way to exploit this tradeoff to mitigate
its harmful effects. Inspired by the TD() algorithm in reinforcement
learning, we introduce subtrajectory balance or SubTB(), a GFlowNet
training objective that can learn from partial action subsequences of varying
lengths. We show that SubTB() accelerates sampler convergence in
previously studied and new environments and enables training GFlowNets in
environments with longer action sequences and sparser reward landscapes than
what was possible before. We also perform a comparative analysis of stochastic
gradient dynamics, shedding light on the bias-variance tradeoff in GFlowNet
training and the advantages of subtrajectory balance.Comment: ICML 202
Anti-tumour necrosis factor discontinuation in inflammatory bowel disease patients in remission: study protocol of a prospective, multicentre, randomized clinical trial
Background:
Patients with inflammatory bowel disease who achieve remission with anti-tumour necrosis factor (anti-TNF) drugs may have treatment withdrawn due to safety concerns and cost considerations, but there is a lack of prospective, controlled data investigating this strategy. The primary study aim is to compare the rates of clinical remission at 1?year in patients who discontinue anti-TNF treatment versus those who continue treatment.
Methods:
This is an ongoing, prospective, double-blind, multicentre, randomized, placebo-controlled study in patients with Crohn?s disease or ulcerative colitis who have achieved clinical remission for ?6?months with an anti-TNF treatment and an immunosuppressant. Patients are being randomized 1:1 to discontinue anti-TNF therapy or continue therapy. Randomization stratifies patients by the type of inflammatory bowel disease and drug (infliximab versus adalimumab) at study inclusion. The primary endpoint of the study is sustained clinical remission at 1?year. Other endpoints include endoscopic and radiological activity, patient-reported outcomes (quality of life, work productivity), safety and predictive factors for relapse. The required sample size is 194 patients. In addition to the main analysis (discontinuation versus continuation), subanalyses will include stratification by type of inflammatory bowel disease, phenotype and previous treatment. Biological samples will be obtained to identify factors predictive of relapse after treatment withdrawal.
Results:
Enrolment began in 2016, and the study is expected to end in 2020.
Conclusions:
This study will contribute prospective, controlled data on outcomes and predictors of relapse in patients with inflammatory bowel disease after withdrawal of anti-TNF agents following achievement of clinical remission.
Clinical trial reference number:
EudraCT 2015-001410-1
Tweeties Squabbling: Positive and Negative Results in Applying Argument Mining on Social Media
International audienc
Tweeties Squabbling: Positive and Negative Results in Applying Argument Mining on Social Media
International audienc